
\documentclass[11pt]{article} % use larger type; default would be 10pt
\usepackage{framed}
\usepackage[utf8]{inputenc} % set input encoding (not needed with XeLaTeX)
\usepackage{geometry} % to change the page dimensions
\geometry{a4paper} % or letterpaper (US) or a5paper or....
% \geometry{margin=2in} % for example, change the margins to 2 inches all round
% \geometry{landscape} % set up the page for landscape
%   read geometry.pdf for detailed page layout information

\usepackage{graphicx} % support the \includegraphics command and options

% \usepackage[parfill]{parskip} % Activate to begin paragraphs with an empty line rather than an indent

%%% PACKAGES
\usepackage{booktabs} % for much better looking tables
\usepackage{array} % for better arrays (eg matrices) in maths
\usepackage{paralist} % very flexible & customisable lists (eg. enumerate/itemize, etc.)
\usepackage{verbatim} % adds environment for commenting out blocks of text & for better verbatim
\usepackage{subfig} % make it possible to include more than one captioned figure/table in a single float
% These packages are all incorporated in the memoir class to one degree or another...
\usepackage{framed}

%%% HEADERS & FOOTERS
\usepackage{fancyhdr} % This should be set AFTER setting up the page geometry
\pagestyle{fancy} % options: empty , plain , fancy
\renewcommand{\headrulewidth}{0pt} % customise the layout...
\lhead{}\chead{}\rhead{}
\lfoot{}\cfoot{\thepage}\rfoot{}

%%% SECTION TITLE APPEARANCE
\usepackage{sectsty}
\allsectionsfont{\sffamily\mdseries\upshape} % (See the fntguide.pdf for font help)
% (This matches ConTeXt defaults)

%%% ToC (table of contents) APPEARANCE
\usepackage[nottoc,notlof,notlot]{tocbibind} % Put the bibliography in the ToC
\usepackage[titles,subfigure]{tocloft} % Alter the style of the Table of Contents
\renewcommand{\cftsecfont}{\rmfamily\mdseries\upshape}
\renewcommand{\cftsecpagefont}{\rmfamily\mdseries\upshape} % No bold!
\begin{document}

\section*{Anomaly Detection}
\begin{itemize}
	\item Suppose you are developing an anomaly detection system to catch manufacturing defects in airplane engines. 
	\item Your model uses
	{ 
		\Large
		\[p(x)= \Pi ^{n}_{j=1} p(x_j;\mu_j,\sigma^2_j)\] 
	}
	\item You have two features $x_1$ = \textit{\textbf{vibration intensity}}, and $x_2$ = \textit{\textbf{heat generated}}. 
	\item Both $x_1$ and $x_2$ take on values between 0 and 1 (and are strictly greater than 0), and for most "normal" engines you expect that $x_2 \approx x_2$. 
	\item One of the suspected anomalies is that a flawed engine may vibrate very intensely even without generating much heat (large $x_1$, small $x_2$), 
	even though the particular values of $x_1$ and $x_2$ may not fall outside their typical ranges of values. 
	\item What additional feature $x_3$ should you create to capture these types of anomalies:
\end{itemize}
\textbf{Solution Options}
\begin{itemize}
	\item $x_3=x_1+x_2$	This could take on large or small values for both normal and anomalous examples, so it is not a good feature.
\end{itemize}


%--------------------------------------------------------------------%
\subsection*{Question 1. }
For which of the following problems would \textbf{anomaly detection} be a suitable algorithm?

\begin{itemize}
	\item From a large set of hospital patient records, predict which patients have a particular disease (say, the flu).
	\item From a large set of primary care patient records, identify individuals who might have unusual health conditions.
	\item CORRECT
	Given data from credit card transactions, classify each transaction according to type of purchase (for example: food, transportation, clothing).
	\item 
	In a computer chip fabrication plant, identify microchips that might be defective.
\end{itemize}

%--------------------------------------------------------------------%
\subsection*{Question 2. Variant A}
Suppose you have trained an anomaly detection system that flags anomalies when p(x) is less than $\epsilon$, and you find on the cross-validation set that it has 
too many false negatives (failing to flag a lot of anomalies). What should you do?


\begin{itemize}
	\item Increase $\epsilon$ [CORRECT]
	\item Decrease $\epsilon$
\end{itemize}

\subsection*{Question 2. Variant B}
Suppose you have trained an anomaly detection system for fraud detection, and your system that flags 
anomalies when p(x) is less than $\epsilon$, and you find on the 
cross-validation set that it mis-flagging far too many good transactions as fradulent. What should you do?

\begin{itemize}
	\item Increase $\epsilon$ 
	\item Decrease $\epsilon$ [CORRECT]
\end{itemize}

%--------------------------------------------------------------------%
\subsection*{Question 3. }
Suppose you are developing an anomaly detection system to catch manufacturing defects in airplane engines. You model uses

% \[p(x)=∏nj=1p(xj;\muj,\sigma^2_j).\]

You have two features $x_1$ = vibration intensity, and $x_2$ = heat generated. 

Both x1 and x2 take on values between 0 and 1 (and are strictly greater than 0), and for most "normal" engines you expect that $x_1 \approx x_2$. 

One of the suspected anomalies is that a flawed engine may vibrate very intensely even without generating much heat (large x1, small x2), even though the particular values of x1 and x2 may not fall outside their typical ranges of values. What additional feature x3 should you create to capture these types of anomalies:

\begin{itemize}
	\item x3=x1/x2 [CORRECT]
	
	\item x3=x21 $times$ x2
	
	\item x3=x1 $times$ x2
	
	\item  x3=x1+x2
	
\end{itemize}

%--------------------------------------------------------------------%
\subsection*{Question 4. }
Which of the following are true? Check all that apply.

\begin{itemize}
	\item When evaluating an anomaly detection algorithm on the cross validation set (containing some positive and some negative examples), classification accuracy is usually a good evaluation metric to use.
	\item  In anomaly detection, we fit a model p(x) to a set of negative (y=0) examples, without using any positive examples we may have collected of previously observed anomalies.
	\item \textbf{CORRECT} When developing an anomaly detection system, it is often useful to select an appropriate numerical performance metric to evaluate the effectiveness of the learning algorithm.
	\item In a typical anomaly detection setting, we have a large number of anomalous examples, and a relatively small number of normal/non-anomalous examples.
\end{itemize}
%--------------------------------------------------------------------%
\subsection*{Question 5. }
You have a 1-D dataset $\{x(1),\ldots,x(m)\}$ and you want to detect outliers in the dataset. You first plot the dataset and it looks like this:


Suppose you fit the gaussian distribution parameters $\mu_1$ and $\sigma^2_1$ to this dataset. Which of the following values for $\mu_1$ and $\sigma^2_1$ might you get?
\begin{itemize}
	\item $\mu_1$=-3,$\sigma^2_1$=4
	\item 
	$\mu_1$=-6,$\sigma^2_1$=4
	\item
	$\mu_1$=-3,$\sigma^2_1$=2
	\item 
	$\mu_1$=-6,$\sigma^2_1$=2
\end{itemize}



%------------------------------------------------%
% ML Week 9

\section{Week 9 Quiz. Recommender Systems}

Information Filtering System that attempts to recommend information items likely
to be of interest to a user.

\textbf{Commonly used algorithms}
\begin{itemize}
	\item $k-$means clustering
	\item Pearson's Rho
	\item Collaborative Filtering
\end{itemize}

Collaborative Filtering is the process of filtering for information or patterns using collaboration among multiple
agents.

Applications: online news aggregation or similar items of clothings

best approached by other methods - prediction

Collaborative Filtering Gradient
%NOT FINISHED

\[ \frac{\partial J}{\partial X^{(i)}_k}  = \sum [  ] \theta^{(j)}_k \]
\[ \frac{\partial J}{\partial \theta^{(i)}_k}  = \sum [  ] X^{(j)}_k \]

No regularization applied

%--------------------------------------%
Anomaly Detection
Gaussian Distribution
Estimate Gaussian Distribution

For $n$ feastures of $X$ , compute the mean and variance for each feature

%-----------%
Selecting Threshold of $\epsilon$.

Implement an algorithm to select the threshold $\epsilon$ using an $F_i$ score on a 
cross validation set.

$P(X) < \epsilon$ is considered to be an anomaly.




\section*{Recommender Systems}


%=====================================================================%

\subsection*{Question 1 }  
Suppose you run a bookstore, and have ratings (1 to 5 stars)

of books. Your collaborative filtering algorithm has learned

a parameter vector $\theta(j)$ for user j, and a feature

vector $ \times $(i) for each book. You would like to compute the

"training error", meaning the average squared error of your

system's predictions on all the ratings that you have gotten

from your users. Which of these are correct ways of doing so (check all that apply)?

For this problem, let m be the total number of ratings you

have gotten from your users. 
% (Another way of saying this is
%
%that m=∑nmi=1∑nuj=1r(i,j)). [Hint: Two of the four options below are correct.]
%
%
%1m∑(i,j):r(i,j)=1((\theta(j))T$ \times $(i)−r(i,j))2 SELECTED
%
%1m∑(i,j):r(i,j)=1((\theta(j))T$ \times $(i)−y(i,j))2
%
%1m∑nmi=1∑j:r(i,j)=1(∑nk=1(\theta(j))k$ \times $(i)k−y(i,j))2 SELECTED
%
%1m∑nuj=1∑i:r(i,j)=1(∑nk=1(\theta(k))j$ \times $(k)i−y(i,j))2

%=====================================================================%

\subsection*{Question 2. } 

In which of the following situations will a collaborative filtering system be the most appropriate learning algorithm (compared to linear or logistic regression)?

\begin{itemize}
\item WRONG You're an artist and hand-paint portraits for your clients. Each client gets a different portrait (of themselves) and gives you 1-5 star rating feedback, and each client purchases at most 1 portrait. You'd like to predict what rating your ne$ \times $t customer will give you.

\item SELECTED You manage an online bookstore and you have the book ratings from many users. You want to learn to predict the e$ \times $pected sales volume (number of books sold) as a function of the average rating of a book.

\item SELECTED You own a clothing store that sells many styles and brands of jeans. You have collected reviews of the different styles and brands from frequent shoppers, and you want to use these reviews to offer those shoppers discounts on the jeans you think they are most likely to purchase

\item WRONG You run an online bookstore and collect the ratings of many users. You want to use this to identify what books are "similar" to each other (i.e., if one user likes a certain book, what are other books that she might also like?)
\end{itemize}
%=====================================================================%
\subsection*{Question 3 }  

You run a movie empire, and want to build a movie recommendation system based on collaborative filtering. There were three popular review websites (which we'll call A, B and C) which users to go to rate movies, and you have just acquired all three companies that run these websites. You'd like to merge the three companies' datasets together to build a single/unified system. On website A, users rank a movie as having 1 through 5 stars. On website B, users rank on a scale of 1 - 10, and decimal values (e.g., 7.5) are allowed. On website C, the ratings are from 1 to 100. You also have enough information to identify users/movies on one website with users/movies on a different website. Which of the following statements is true?

\begin{itemize}
\item You can combine all three training sets into one without any modification and e$ \times $pect high performance from a recommendation system.

\item It is not possible to combine these websites' data. You must build three separate recommendation systems.

\item CORRECT You can merge the three datasets into one, but you should first normalize each dataset separately by subtracting the mean and then dividing by (ma$ \times $ - min) where the ma$ \times $ and min (5-1) or (10-1) or (100-1) for the three websites respectively.

\item You can combine all three training sets into one as long as your perform mean normalization and feature scaling after you merge the data.

\end{itemize}

%=====================================================================%
\subsection*{Question 4 } 
Which of the following are true of collaborative filtering systems? Check all that apply.

\begin{itemize}

\item WRONG Suppose you are writing a recommender system to predict a user's book preferences. In order to build such a system, you need that user to rate all the other books in your training set.

\item CORRECT For collaborative filtering, it is possible to use one of the advanced optimization algoirthms (L-BFGS/conjugate gradient/etc.) to solve for both the $ \times $(i)'s and $\theta(j)$'s simultaneously.

\item CORRECT Even if each user has rated only a small fraction of all of your products (so r(i,j)=0 for the vast majority of (i,j) pairs), you can still build a recommender system by using collaborative filtering.

\item WRONG For collaborative filtering, the optimization algorithm you should use is gradient descent. In particular, you cannot use more advanced optimization algorithms (L-BFGS/conjugate gradient/etc.) for collaborative filtering, since you have to solve for both the $ \times $(i)'s and $\theta(j)$'s simultaneously.

\end{itemize}

%=====================================================================%

\subsection*{Question 5 } 

Suppose you have two matrices A and B, where A is 5$ \times $3 and B is 3$ \times $5. Their product is C=AB, a 5$ \times $5 matri$ \times $. Furthermore, you have a 5$ \times $5 matri$ \times $ R where every entry is 0 or 1. You want to find the sum of all elements C(i,j) for which the corresponding R(i,j) is 1, and ignore all elements C(i,j) where R(i,j)=0. One way to do so is the following code:


Which of the following pieces of Octave code will also correctly compute this total? Check all that apply. Assume all options are in code.

\begin{itemize}

\item CORRECT total = sum(sum((A * B) .* R))

\item CORRECT C = (A * B) .* R; total = sum(C(:));

\item WRONG total = sum(sum((A * B) * R));

\item WRONG C = (A * B) * R; total = sum(C(:));
\end{itemize}
%=====================================================================%

Q1

For which of the following problems would anomaly detection be a suitable algorithm?

SELECTED In a computer chip fabrication plant, identify microchips that might be defective.

Given data from credit card transactions, classify each transaction according to type of purchase (for example: food, transportation, clothing).

From a large set of hospital patient records, predict which patients have a particular disease (say, the flu).

SELECTED From a large set of primary care patient records, identify individuals who might have unusual health conditions.


%------------------------------------------------------------------------%

Q2

Suppose you have trained an anomaly detection system that flags anomalies when p(x) is less than ε, and you find on the cross-validation set that it has too many false positives (flagging too many things as anomalies). What should you do?

SELECTED Decrease ε

Increase ε

% ------------------------------------------------------------------------%


Q3


x1/x2

% ------------------------------------------------------------------------%


Q4

In anomaly detection, we fit a model p(x) to a set of negative (y=0) examples, without using any positive examples we may have collected of previously observed anomalies.

When evaluating an anomaly detection algorithm on the cross validation set (containing some positive and some negative examples), classification accuracy is usually a good evaluation metric to use.

When developing an anomaly detection system, it is often useful to select an appropriate numerical performance metric to evaluate the effectiveness of the learning algorithm.

In a typical anomaly detection setting, we have a large number of anomalous examples, and a relatively small number of normal/non-anomalous examples.



Q5 - \mu1=−3,\sigma^2_1=4


\end{document}
